Evolution models with base substitutions, insertions, deletions and selection 
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The evolution model with parallel mutation-selection scheme is solved for the case when selection 
is accompanied by base substitutions, insertions, and deletions. The fitness is assumed to be either 
a single-peak function (i.e., having one finite discontinuity) or a smooth function of the Hamming 
distance from the reference sequence. The mean fitness is exactly calculated in large-genome limit. 
In the case of insertions and deletions the evolution characteristics depend on the choice of reference 
sequence. 
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I. INTRODUCTION 

The existence of insertions and deletions is well estab- 
lished experimentally. There has been a recent consid- 
erable interest in molecular evolution models of Eigen, 
i.e., in connected mutation-selection scheme [1-2], and 
in the parallel, i.e., "decoupled" mutation schemes [3- 
7]. The studies included mean fitness for different fit- 
ness landscapes [4-7], and population distributions un- 
der mutation-selection balance constraint [8]. Since the 
experimental confirmation of insertion and deletion pro- 
cesses [9] there have been several studies of molecular 
evolution models that incorporate base substitutions, in- 
sertions, and deletions [10-12]. In this article we integrate 
a concept of indels with parallel mutation-selection pro- 
cesses to solve our model with general fitness landscape 
and derive exact formula for the mean fitness. 

In biology research the term indcl stands for cither 
insertion alone or deletion alone or both these processes 
present simultaneously. Indels play an important role in 
phylogenic analysis in practical population genetics [13], 
where incorrect handling of indels may give unrealistic 
outcomes. 

In the parallel mutation-selection model any genotype 
configuration i is specified as a sequence of N two- valued 
letters (alleles) s n = ±1, 1 < n < N. We denote such 
configuration i by Si = (s\, . . . , s z N ). Probability pi that 
configuration Si occurs in genome, 1 < i < 2 N , satisfies 



dpi 
dt 



Pi 
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2_j l^iiPji 
3 = 1 



(1) 



where is the fitness, and ^ is the mutation rate from 



Si to Sj per unit time. For the Crow-Kimura model 

[4]: 
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= a if d, 



-aN if the Hamming distance dij is zero, 



1, and jAij = if dij > 1; where di 
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In the models studied here we consider the following 
three independent parallel processes in the genome: base 
substitutions, deletions and insertions. Assuming con- 
stant genome- variation rates per site, we denote a/ 'No, 
b/No, and c/Nq the rates of mutation, insertion, and dele- 
tion, respectively, where Ao >> 1 is the scale length of 
the genome. Unlike in the well-studied cases of the paral- 
lel mutation-selection scheme and the Eigen model, now 
the genome length can be varied. In this paper we focus 
only on symmetric fitness landscape, i.e., when the fitness 
of the genome is a function of Hamming distance from 
one reference genome sequence. The fitness is assumed 
to be either a single-peak function (i.e., having one fi- 
nite discontinuity) or a smooth function of the Hamming 
distance. 

In the first model, analyzed in Scc[TTl we are investi- 
gating indels acting in a toy problem when the reference 
sequence is ordered, i.e., when it contains only one letter 
(either +1 or —1) at all positions. Obtaining solution 
to this toy problem is by no means trivial because nei- 
ther the maximum principle [5] nor the Hamilton- Jacobi 
method [8] can be applied directly. A more realistic 
case is analyzed in Scc lIIII for random reference sequence 
when the letters +1 and —1 arc randomly distributed 
along genome length. For symmetric fitness in parallel 
mutation-selection model without indels the choice of the 
reference sequence does not affect the solution, which is 
a consequence of the existing symmetry of the governing 
equations. The introduction of indels to the model breaks 
this symmetry and the effect of indels acting on sequence 
space is to change the solution. This change depends on 
the choice of the reference sequence. If we choose as the 
reference sequence the one with all + alleles, the result 
of deletion is the same for all the N possible position of 
deleted allele. In case a random reference sequence we 
have different results (sequences) after different positions 
of deleted allele. In our model, an individual indcl event 
means either insertion or deletion of a single letter in the 
genome sequence, one at a time, but there may be many 
indel events during evolution. In this article we focus 
on investigating a "successful selection" phase, i.e., the 
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phase (range of parameters) with majority of population 
being localized around the reference sequence. Our re- 
sults are discussed in SeclIVl 



II. ORDERED REFERENCE SEQUENCE 

We choose the reference sequence that has all the al- 
leles +1 and the initial distribution of sequences that is 
symmetric under permutations. Individual configuration 
is denoted by (N, L) , where N is genome length, and L 
is the number of (-f-l)-alleles in the configuration. The 
fitness is Ngr(N, L). Considering only one- letter deletion 
or insertion at a time there may be three processes that 
start at (N',L') and end at (N,L): 

• Simple base substitutions at the rate of a/Ng. At 
the beginning there are L configurations (N, L — 1) 
and N — L configurations with (N, L + 1). 

• Deletions at the rate of b/No. At the beginning 
there are N + 1 configurations (N + 1, L + 1) and 
N + 1 configurations with (N + l,L). 

• Insertion at the rate of c/No. At the beginning 
there are L configurations (N— 1, L~ 1) and (N — L) 
configurations (N — 1,L). 

During base substitutions the letters (alleles) change 
their signs. During deletion one of the letters disappears. 
During insertion a new letter (either +1 or —1) is added 
randomly at any of the {N+l) positions along the chain. 

There are two interests in solving this model, usu- 
ally treated separately. One interest concerns to genome 
growth [10-12]. The other interest is the study of succes- 
sive selection phase, which we present here for the case 
when selections and base-substitutions are accompanied 
by deletions and insertions. 

The occurrence probability p(N, L) denotes a frac- 
tional number of configurations (N, L) in the popula- 
tion. For symmetric fitness landscape and permutation- 
symmetric initial distribution, probabilities p{N, L) sat- 
isfy the equations [T-ij : 

dp(N,L) = 
dt 

( N N+l 

p(N, L) i N r (N,L)-—(a + b)-c 



{ ±- p{N ,L-l)+p(N,L + l)^ 



+ b [p(N + 1, L + 1) +p(N + 1, L)\ 



N + l 

Nn 



(2) 



+ l(p { N-l,L-l)± + p { N-l,L)^ 

I ' N' 
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FIG. 1: Mean fitness R vs fitness parameter J for a = b = 1, 
and c = 2. Theoretical results are plotted as continuous lines. 
Numerical results are represented by symbols. Error bars give 
percent difference between numerical and theoretical results, 
(a) Single-peak mean fitness for N = 1000. Error bars are 
about 0.05%, smaller than symbol size, (b) Quadratic fitness, 
r = Jm 2 , for N = 200. Error bars are about 0.5%. 



In case of symmetric fitness landscape, for finding steady- 
state mean fitness it is sufficient to consider only sym- 
metric evolution. We solved Eq. ([3]) numerically, varying 
genome length between TVi and N2 subject to N2 — Ni 
1 . The numerical results for two values of genome length 
are presented in Fig. 1. 

Eq. (2) is slightly modified near the border values 
of N, as at A^2 there are only deletions and at A^-only 
insertions. The weighted sum over all equations is zero, 
where the weights (^ ) are numbers of configurations with 
the same N and L. 

For single-peak fitness landscape we set all fitness val- 
ues to zero but one: 



r(No,No) = J, r = otherwise, 



(3) 



and Ao = Ei±E2._ j n continuous-time model, considered 
in this work, fitness landscape Eq.Q can be rescalcd 
by an additive constant, which is a standard procedure 
in statistical physics. In discrete-time models all fitness 
values would have to be positive. For base substitu- 
tions acting alone without indels, with the choice given 
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by Eq.([3]) the system of Eqs.([2]) decouples, which leads 
to a single equation for only one master-type (reference 
sequence) probability [6] from which other probabilities 
are obtained recursively. When base substitutions and 
indcls arc simultaneously acting, for single-peak fitness 
defined by Eq.([3]) the system Eqs.© does not decouple 
but, nonetheless, can be reduced to a tractable problem 
that can be treated analytically. The reduction proce- 
dure is outlined in the next paragraphs. 

We consider the following scaling of Eqs.©: 



p(N,N)~l, 
p(N, L) ~ 1/Nq~ 



(4) 



Since the scaling ((4]) suppresses contributions from all 
terms (N,L) for L < N as 1/N, after the scaling 
we obtain a complete set of equations for the class 
p(N, N),N\ < N < N2 of frequencies. For example, 
for p(N, TV - 1) we derive from Eqs.©: 

p(N, N —1) = -£rp(N, N) + -^-p(N - 1, N - 1), (5) 

which is easily verified to be consistent with the scaling 
ansatz 

Denoting by P a collection of all p(N, N), where Pi = 
p(Ni - 1 + 1, Nx - 1 + 1), we write Eq.© for p(N, N) as 



dP 
~~dt 



= AP- RP, 



where R is the mean fitness, R — Jp(N , N ); and, the 
elements of matrix A are 



An = -(a + b + c) + J5i : i , 
= c/2, 



(6) 



where Iq = No — N\ + 1, and 61 j is Kronecker's symbol. 
In successful-selection phase the majority of population 
is distributed around the configuration (N ,N ). The 
details of matrix A near the borders at I — 1, I = AI + 1, 
and M = N2 — Ni , arc irrelevant for the computation of 
the mean fitness in the successful-selection phase in the 
sense that the mean fitness is insensitive to variations in 
these border values. The mean fitness R is obtained in 
the standard way as the largest eigenvalue of A by solving 
the secular equation 



det (A - \f \ = 0, 



(7) 



where / is the identity matrix, and R = max (A) . 

To calculate R within the 1/ATo-accuracy we utilize the 
properties of determinant and in Eq.©, modify the ma- 
trix A, taking A\^m+\ — b and Am+i,i — c/2. Then, we 
define an auxiliary function g(J) by g(J) = det(A — RI). 
Since g (J) is linear we write 

g(J)=g(0) + Jg'(0), 



where g(0) = det(B — RI); the matrix B is the value of 
the matrix A computed at J = 0; and, </(0) is the first 
derivative of g(J) computed at J = 0. Because matrix 
B is symmetric and cyclic it is relatively straightforward 
to write g(0) and g'(0) explicitly: 

At 

5 (0) = [J (be^ l / M + V^/m _ ( a + 6 + c ) - i?) , 



s'(o) 



1=0 



1 



M 



1 



g(O) M ^ be l2 ^/ M + £ e - i2vl / M -{a + b + c)- R' 

(8) 

In the thermodynamic limit of large Nq, Ni, and N2, 
the infinite summation on the right-hand-side of Eq.jH]) 
becomes a contour integral in complex plane. The left- 
hand-side ofEq.® is flf'(0)/£f(0) = (g(J)/g(0)-l)/J and 
g(J) = because of Eq.((7]). Thus, making the substitu- 
tion z = cxp(i2wl / M) , Eq.© gives the relation between 
the mean fitness R and the fitness J of the peak config- 
uration: 

J fdz 1 

2iri J zbz + ^-(a + b + c)-R 

j (9) 



1 



2bc 



Inverting Eq.@ gives the mean fitness R and fractional 
population P m of the peak configuration: 



R= VJ 2 + 2bc - (a + b + c), 
v 7 J 2 + 2bc -(a + b + c) 



(10) 



P m =p(N ,N ) 
and the error-threshold condition 



J 



J> \/ (a + b + c) 2 - 2bc. (11) 
Results of Eqs.Q and (fT0|) are illustrated in Fig. la. 



A. Nonzero fitness at one N value 
and many L values 

The sharp-peak fitness defined by Eq. ([3]) is an oversim- 
plification as it is believed that realistic fitness landscapes 
are highly complicated and irregular. As a step towards 
generalization we consider now fitness that is nonzero at 
only one N value, set to N = Nq, and at many values 
of L. Here, L is the number of the (+l)-alleles in the 
genome and Hamming distance to the reference configu- 
ration is N — L. For this more general fitness we take 



r(N,L) = 8 N , No f(2L/N -l), 



(12) 



where /(•) is a smooth function. Following a method in- 
troduced by Baake and Wagner 0] we transform Eqs.{T]) 
to a more convenient form with the use of the substitu- 
tion 



y(N,L)=p(N,L) 



L\(N - L)V 



(13) 
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Equations for the weighted fractional populations 
y(N,L) simplify in the large-genome limit. For the com- 
putation of the mean fitness they are easier to handle 
than the original Eqs.([T]). We have checked rigorously by 
calculating the distribution y(N,L) that it is a smooth 
function of L/Nq for the given N = Nq, although it 
is not smooth for all N. Assuming that y(N, L) is a 
smooth function of (2L/N — 1) near N = No and near 
the location Lq(Nq) of its maximum, we replace y(N,L) 
with y(N,L (N )) in the coupled system of equations 
for y(N, L) that was obtained from Eqs.fl]) after applying 
transformation (fT5|) . As described for single-peak fitness, 
this gives a partial decoupling. For the decoupled part 
we have the eigenvalue problem 

Ay = Ay, 

where y t = y(N,L (N )), I = N — Ni + 1, and R = 
max (A). Again, the matrix A is tri-diagonal. In the 
limit of large N the eigenproblem for A gives 

^Vi = Vi 0>l,l o f( m ) - {a + b + c)+ a\J 1 - m 2 J 



+ (yi+ib 



(14) 



V2 



where m = (JV- 2L )/N , l = N - Ni + 1. There is full 
analogy between this problem and the problem already 
solved for single-peak function. Quadratic form for the 
single-peak problem can be obtained from quadratic form 
for the current problem by performing the mapping: b — > 

R- ay/ 1 -to 2 . 



V2 ' u ° V2 

By repeating the steps that lead to Eq.© we derive 



R 



:{ - (a + b + c) + ayj\^' 



+ \J f 2 (m) + be (y/1 + m + y/l- to) 2 "}. 



(15) 

Theoretical results of Eqs. p^|) and (fT5"|) for quadratic fit- 
ness are presented in Fig. lb. 



B. General fitness landscape 



distribution y(n, m) must have a maximum localized at 
N and L(N). Then, if the maximum exists the system of 
equations Eqs.{T|) for fractional populations can be par- 
tially decoupled at configuration (N,L(N)). This leads 
to the algebra problem of finding the largest eigenvalue 
of a matrix, R = max(A). In analogy with Eq.JTJJ, the 
intermediary result is 



Vi 



(Si,i f(n, to) - n(a + b + c) + ayj n 2 - m 2 ^j 



+ (Vi+i + Vi-i) (Vn + m + \Jn - m) 



be 



Finally, the mean fitness R is the largest eigenvalue A 
that is obtained by solving Eqs. (fT"7f : 



R = max {f(n, m) — n(a + b + c) + ay/ n 2 — m 2 



+ (y/n + m + y/n — m)y/bc } 



(18) 



Note, the final result Eq. fTg)) requires finding the maxi- 
mum in two-dimensional space of arguments n and to. 



III. RANDOM REFERENCE SEQUENCE 

In this model a reference sequence contains both +1 
and —1 alleles that are randomly distributed along the 
entire genome length. Single-peak fitness is analyzed 
in the next paragraph, followed by the extension of the 
model to general symmetric fitness. 

When reference configuration contains a number of 
pieces with consecutive I number of all (+l)-allclcs or all 
(— l)-alleles per piece, the mean fitness scales according 
to 



(l/N ) a , where a > 1. 



(19) 



Since for the random reference sequence we have I /Nq <C 
1, Eg. (flT?]) implies that in this case the number of con- 
figurations that return to the initial reference sequence 
is negligible and can be neglected. The final result is 
exactly the same as though there were only base substi- 
tutions with the rate a + b + c: 



For the ordered reference sequence we assume now 
nonzero fitness at many N and L values, i.e., fitness is a 
function of both genome length and the number of (+1)- 
alleles: 



,N 2L — N 



(16) 



We assume that fitness f(n,m) is a smooth function of 
both its arguments, i.e., it is also smooth in genome 
length. This is in contrast to the model of Sec lII Al where 
the fitness may change drastically even within one unit of 
genome length. As in the previous two examples, to find 
the mean fitness R it is requested in our approach that 



R = J - {a + b + c), 
and error threshold condition is 

J > {a + b + c) 



(20) 



(21) 



For continuous-time Eigen's model, where r\ = A = 
const and n = l,i > 1, the presence of indcls modifies 
error threshold to: 



QA > 1, 



(22) 



where Q is the probability of errorless reproduction of 
the entire genome. 
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In general-symmetric-fitness model the fitness is de- 
fined by Eq. (|16j) . where we replace L by a parameter 
N — d that explicitly contains Hamming distance d be- 
tween the current configuration and reference configura- 
tion. Here, for any genome length N there is its reference 
configuration, thus, the Hamming distance d is the sec- 
ond independent parameter in the definition of general 
fitness: 



(23) 



where /(•,•) is assumed to be a smooth function of two 
parameters. By the method outlined in Sec|TT]we obtain 
the following general result for the mean fitness 



R = max 

n>m 



|/(n, m) — n(a + b + c) + a \J n 2 — m 2 j 



(24) 



IV. DISCUSSION 



The toy model with ordered reference sequence and 
single-peak fitness, studied in Scc[lll is an interesting 
case from methodological point of view. It describes the 
evolution when there is a desperate difference between 
genome length N and the number L of (+l)-alleles in 
fractional-population distribution. Genotype frequencies 
for this class are smooth functions of L in the neighbor- 
hood of the peak distribution. Error-threshold condition, 
Eq. (|ll|) . depends on all the rates in a nonlinear fashion 
because of the existence of reversal processes introduced 
by indels, which processes may drive the evolution to- 
wards the reference sequence. This picture is unlike to 
what we learn from Crow-Kimura (parallel) model with- 
out indels, where error threshold depends linearly on the 
rates. In the generalized model with single-peak fitness, 
studied in Scc lIIIl where the reference sequence is a ran- 
dom sequence of +1 and —1 alleles, the error-threshold 
formula, Eq. (f2~Tj) . simplifies again to that for the Crow- 
Kimura model with an efficient base substation rate. 

General symmetric-fitness model with random refer- 
ence sequence, analyzed in Sec lIII| presents most real- 
istic situation where both genome length and reference 
sequence are allowed to vary. In this work we investigated 
only steady-state characteristics of this model. 

In our computational approach we used standard 
methods of linear algebra to partially decouple a sys- 



tem of evolutionary equations around the peak distribu- 
tion and find the mean fitness as the largest eigenvalue 
of the decoupled subsystem. As the excellent agreement 
between our analytical and numerical solutions demon- 
strates (see Fig. 1) our methodology has a promise of be- 
coming a routine approach in solving general evolution- 
ary problems with varying genome length, alongside the 
methods of quantum mechanics [i[ that are good when 
genome length is fixed, and the methods of quantum field 
theory [11, [lj]. 

Haploid models with indels that were studied in this 
work can be extended to diploid evolution models with 
parallel insertions and deletions. Similar complex mod- 
els were already considered to study the evolution of gene 
families via conversion processes [17| and gene crossover 
processes [l8|. The latter mentioned mechanisms could 
be in principle handled analytically by modern methods 
fl9| that proved to be successful in treating diploid evo- 
lution da [m. 

In evolution research the models often ignore cither 
selection processes [10-12] or indels [2-7], however, it is 
generally accepted that the concurrent selection and in- 
dels play an important role in biology. The models that 
are capable to describe these two processes as acting si- 
multaneously could give connection with the phylogcny 
analysis and with the investigation of gene families. Our 
introductory study of this work shows that it is possible 
to analytically derive some class of results when selection 
is accompanied by indels. 

In summary, in this work we introduced a method that 
allows one to investigate a broad class of evolution mod- 
els. We solved parallel mutation-selection model with 
general symmetric-fitness landscapes in case of simulta- 
neously acting base substitutions, insertions and dele- 
tions. Our findings indicate that in the steady state of 
this evolution model the mean fitness depends strongly 
on the choice of the reference sequence. 
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